Effect of Instruction Fetch and Memory Scheduling on GPU Performance

نویسندگان

  • Nagesh B. Lakshminarayana
  • Hyesoon Kim
چکیده

GPUs are massively multithreaded architectures designed to exploit data level parallelism in applications. Instruction fetch and memory system are two key components in the design of a GPU. In this paper we study the effect of fetch policy and memory system on the performance of a GPU kernel. We vary the fetch and memory scheduling policies and analyze the performance of GPU kernels. As part of our analysis we categorize applications as symmetric and asymmetric based on the instruction lengths of warps. Our analysis shows that for symmetric applications, fairness based fetch and DRAM policies can improve performance. However, asymmetric applications require more sophisticated policies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Petri Net Analysis of Non-Redundant and Redundant Execution Schemes

The quest for high-performance has led to multiand many-core systems. To push the performance of a single core to the limit, simultaneous multithreading (SMT) is used. SMT enables to fetch different instructions from different threads, hiding latencies in other threads. SMT also gives the opportunity to execute redundant threads (redundant multithreading, RMT) and thus to detect faults by compa...

متن کامل

Effective Instruction Prefetching In Chip Multiprocessors

threaded application performance, often achieved through instruction level parallelism per chip is increasing, the software and hardware techniques to exploit the potential of studies mostly involve distributed shared memory multiprocessors and fetching will not be fully effective at masking the remote fetch latency. the effective address of the load instructions along that path based upon a hi...

متن کامل

Trace Cache Performance

Instruction fetch mechanism is a performance bottleneck of a Superscalar Processor. Fetch performance can be improved with the aid of an instruction memory known as a Trace Cache. This paper presents analytical expressions, which describe instruction fetch performance of a Trace Cache microarchitecture. The instruction fetch rates predicted by the expressions differ by seven percent from the si...

متن کامل

A Compiler-driven Supercomputer

The overall prrformance of supercomputers is slow compared to the speed of their underlying logic technology. This discrepancy is due to several bottlenecks: memories are slower than the CPU, conditional jumps limit the usefulness of pipelining and pre-fetching mechanisms, and functional-unit parallelism is limited by the speed of hardware scheduling. This paper describes a supercomputer archit...

متن کامل

A Graph-based Model for GPU Caching Problems

A GPU is a massively parallel computational accelerator that is equipped with hundreds or thousands of cores. A single GPU can provide more than 4 Teraflops single precision performance at its peak, however, the maximum memory throughput of a GPU card is around 200 GB/s. Such a gap usually prevents GPU’s computation power from being fully harnessed. A cache is a layer in between GPU’s computati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009